2021-10-04 – 2021-10-08
This repository contains teaching and learning materials prepared and used during “Introduction to biostatistics and machine learning” course, organized by NBIS, National Bioinformatics Infrastructure Sweden. The course is open for PhD students, postdoctoral researcher and other employees within Swedish universities. The materials are geared towards life scientists wanting to be able to understand and use basic statistical and machine learning methods. More about the course https://uppsala.instructure.com/courses/51998/
Learning outcomes:
Data can be divided into different types; categorical and quantitative (numeric). How to summarize and analyze your data depends on the type.
Categorical data types are divided into;
Quantitative data types are divided into;
Categorical data can be summarized by counting the number of observations of each category and summarizing in a frequency table or bar plot. Alternatively, the proportions (or percentages) of each category can be calculated.
(#exm:10mice) Ten lab mice
Observe gender and weight of your ten lab mice and summarize.If you want to follow this example, you can download the data here; mice.csv. You get the subset used in this example by the following commands;
## first read the full data set into R
mice <- read.csv("mice.csv")## Then extract the specific subset used in this example
m10 <- subset(mice, subset=week==5 & id %in% 1:10, select = c(id, gender, weight))In this example we have only ten observations (mice) and the full data can actually be shown in a table.
| id | gender | weight | |
|---|---|---|---|
| 1 | 1 | male | 19 |
| 21 | 2 | male | 21 |
| 41 | 3 | female | 18 |
| 61 | 4 | male | 20 |
| 81 | 5 | male | 21 |
| 101 | 6 | male | 17 |
| 121 | 7 | female | 18 |
| 141 | 8 | male | 24 |
| 161 | 9 | male | 22 |
| 181 | 10 | female | 18 |
We are interested in the gender distribution in our group of mice. Count the frequency of male/female mice and summarize in a table. Also, the fraction or percentage can be useful.
| gender | n | percent (%) |
|---|---|---|
| female | 3 | 30 |
| male | 7 | 70 |
The frequencies can also be shown in a barplot.
ggplot(m10, aes(x=gender)) + geom_bar()
barplot(table(m10$gender))(#fig:m10barplot)The number of male and female mice shown in barplots generated using ggplot and basic R graphics.
patients: {L, L, L, R, L, R, R, L, L, R, R, R, R, R, L, R, R, R, R, R, R, R, R, R, R, L, L, R, R, R}
controls: {R, L, R, R, L, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, L, R, R, L, R, R, R, R, R, R, R, R, R, R, R, R, R, R}
Summarized as
| group | Total n | Left handed (%) |
|---|---|---|
| control | 40 | 4 (10%) |
| patient | 30 | 9 (30%) |
or in a contingency table;
| L | R | Sum | |
|---|---|---|---|
| control | 4 | 36 | 40 |
| patient | 9 | 21 | 30 |
| Sum | 13 | 57 | 70 |
Data can be summarized in barplots in several ways;
## Using ggplot to create barplots
ggplot(hand, aes(x=group, fill=handedness)) + geom_bar()
ggplot(hand, aes(x=group, fill=handedness)) + geom_bar(position="dodge")
ggplot(hand, aes(x=group, fill=handedness)) + geom_bar(position="fill") + ylab("Fraction")
## Using basic R graphics to create barplots
tab <- table(hand$handedness, hand$group)
barplot(tab)
barplot(tab, beside=TRUE)
tabperc <- tab
tabperc[,1] <- 100*tab[,1]/sum(tab[,1])
tabperc[,2] <- 100*tab[,2]/sum(tab[,2])
barplot(tabperc)(#fig:barlefthand)Left-handedness in patient and control groups.
Quantitative data (both discrete and continuous) can be visualized in a histogram;
(#fig:sum10dice)Throw 10 dice and count the total number of dots. Repeat the experiment 1000 times. This histogram summarize the results, i.e. the total number of dots when throwing 10 dice.
or a density plot;
(#fig:dicedens)Density plot over the total number of dots when throwing 10 dice.
(#fig:hist)Histogram over weight of 2000 5 weeks old mice, colored according to gender.
Summary statistics for numeric data are usually divided into measures of location and spread.
For n onservations x1, x2, …, xn, the mean value is calculated as;
$$\bar x = \frac{x_1+x_2+\dots+x_n}{n} = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i$$ Note, several very different distributions can still have the same mean value.
(#fig:mean35)All these distributions have the same mean value, 3.50.
Quartiles - the three values that divide the data values into four equally sized groups.
The variance of a set of observations is their mean squared distance from the mean value;
$$\sigma^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2.$$ the variance is measured in the square of the unit in which x was measured. a commonly used measured on the same unit as x is the standard deviation, defined as the square root of the variance;
$$\sigma = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2}$$ The denominator n is commonly replaced by n − 1 and the sample standard deviation is calculated instead;
$$s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2}.$$ The latter formula is used if we regard the collection of observations x1, …, xn as a sample drawn from a large population of possible observations.
I we want to describe the variance/standard deviation only of our set of observations, the former formula should be used, calculation of a population standard deviation σ (i.e. we consider the set of observations to be the full population).
If instead, we want to estimate the variance of a larger population from which our smaller sample is drawn, we should calculate the sample standard deviation, s.
(#exr:Datasummary) Data summary
Consider the below data and summarize each of the variables. There is no need to use R here, just use pen and paper, maybe use R as a calculator.
| id | smoker | baby weight (kg) | gender | mother weight (kg) | mother age | parity | married |
|---|---|---|---|---|---|---|---|
| 1 | yes | 2.8 | F | 64 | 21 | 2 | yes |
| 2 | yes | 3.2 | M | 65 | 27 | 1 | yes |
| 3 | yes | 3.5 | F | 60 | 31 | 2 | yes |
| 4 | yes | 2.7 | F | 73 | 32 | 0 | yes |
| 5 | yes | 3.3 | M | 59 | 39 | 3 | yes |
| 6 | no | 3.7 | F | 62 | 26 | 0 | no |
| 7 | no | 3.3 | F | 52 | 27 | 2 | no |
| 8 | no | 4.3 | F | 59 | 21 | 0 | no |
| 9 | no | 3.2 | M | 65 | 28 | 1 | no |
| 10 | no | 3.0 | M | 81 | 33 | 4 | yes |
(#exr:cholesterol) Exercise in distribution of sample mean
The total cholesterol in population (mg/dL) is normally distributed with μ = 202 and σ = 40.
(#exr:pill2) Amount of active substance
The amount of active substance in a pill is stated by the manufacturer to be normally distributed with mean 12 mg and standard deviation 0.5 mg. You take a sample of five pill and measure the amount of active substance to; 13.0, 12.3, 12.6, 12.5, 12.7 mg.
@ref(exr:baby) Smokers: 5 (50%) yes baby weight (kg) mean (sd): 3.3 (0.44) gender: 6 (60%) F mother weight (kg) mean(sd): 64 (8.5) mother age mean(sd): 28.5 (5.8) partity mean(sd): 1.5 (1.4) could also be handled as categorical (ordinal) and report frequencies and percentages. married: 4 (40%) yes
Did you caompute standard deviations that were aslightly different? Then you probably computed the sample standard deviation, which could actually be what you want to report. When do you want to compute sample standard deviation?
@ref(exr:cholesterol)
@ref(exr:pill2)
Learning outcomes
Some things are more likely to occur than others. Compare:
We intuitively believe that the chance of sun rising or dark winter occurring are enormously higher than COVID-19 disappearing over night or having no rain over the entire summer. Probability gives us a scale for measuring the likeliness of events to occur. Probability rules enable us to reason about uncertain events. The probability rules are expressed in terms of sets, a well-defined collection of distinct objects.
Suppose we perform an experiment that we do not know the outcome of, i.e. we are uncertain about the outcome. We can however list all the outcomes that might occur.
Based on the axioms the following rules of probability can be proved.
Let E, F ⊆ S be two events that P(E) > 0 then the conditional probability of F given that E occurs is defined to be: $$P(F|E) = \frac{P(E\cap F)}{P(E)}$$
Product rule follows conditional probability: let E, F ⊆ S be events such that P(E) > 0 then: P(E∩F) = P(F|E)P(E) = P(E|F)P(F)
The urn model is a simple model commonly used in statistics and probability. In the urn model, real objects (such as people, mice, cells, genes, molecules, etc) are represented by balls of different colors. A fair coin can be represented by an urn with two balls representing the coins two sides. A group of people can be modelled in an urn model, if age is the variable of interest, we write the age of each person on the balls. If we instead are interested in if the people are allergic to pollen or not, we color the balls according to allergy status.
(#fig:urns)Urn models of a fair coin, age of a group of people, pollen allergy status of a group of people.
In the urn model every unit (ball) is equally likely of being selected. This means that the urn model us well suited to represent flipping a fair coin. However, a biased coin can also be modelled using an urn model, by changing the number of balls that represent each side of the coin.
By drawing balls from the urn with (or without) replacement, probabilities and other properties of the model can be inferred.
The outcome of a random experiment can be described by a random variable. Whenever chance is involved in the outcome of an experiment the outcome is a random variable.
A random variable can not be predicted exactly, but the probability of all possible outcomes can be described. The sample space is the set of all possible outcomes of a random variable. Note, the sample space is not always countable.
A random variable is usually denoted by a capital letter, X, Y, Z, …. Values collected in an experiment are observations of the random variable, usually denoted by lowercase letters x, y, z, ….
The population is the collection of all units studied.
A sample is a subset of the population.
Example random variables and probabilites:
Conditional probability can be written for example P(W≥3.5|S=1), which is the probability that X ≥ 3.5 if S = 1, in words “the probability that a smoking mother has a baby with birth weight of 3.5 kg or more”.
A discrete random number has countable number of outcome values, such as {1,2,3,4,5,6}; {red, blue, green}; {tiny, small, average, large, huge} or all integers.
A discrete random variable can be described by its probability mass function, pmf.
The probability that the random variable, X, takes the value x is denoted P(X=x) = p(x). Note that:
(#exm:rolldie) The number of dots on a die
When rolling a die the there are six possible outcomes; 1, 2, 3, 4, 5 and 6, each of which have the same probability, if the die is fair. The outcome of one dice roll can be described by a random variable X. The probability of a particular outcome x is denoted P(X=x) or p(x).The probability mass function of a fair six-sided die can be summarized in a table;
| x | 1.00 | 2.00 | 3.00 | 4.00 | 5.00 | 6.00 |
| p(x) | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 |
or in a barplot;
(#fig:die)Probability mass function of a die.
| non-smoker | smoker | |
|---|---|---|
| x | 0 | 1 |
| p(x) | 0.61 | 0.39 |
(#exm:bacteria) CFU
The number of bacterial colonies on a plate is a random number.(#fig:CFU)Probability mass distribution of the number of bacterial colonies on an agar plate.
When the probability mass function is know the expected value of the random variable can be computed.
$$E[X] = \mu = \sum_{i=1}^N x_i p(x_i)$$ For a uniform distribution, where every object has the same probability (in the urn model, every object is represented by one ball), the expected value can be computed as the sum of all objects divided by the total number of objects;
$$E[X] = \mu = \frac{1}{N}\sum_{i=1}^N x_i$$ Linear transformations and combinations
E(aX) = aE(X)
E(X+Y) = E(X) + E(Y)
E[aX+bY] = aE[X] + bE[Y]
The variance is a measure of spread and is defined as the expected value of the squared distance from the expected value;
$$var(X) = \sigma^2 = E[(X-\mu)^2] = \sum_{i=1}^n (x_i-\mu)^2 p(x_i)$$ Linear transformations and combinations
var(aX) = a2var(X)
For independent random variables X and Y
var(aX+bY) = a2var(X) + b2var(Y)
Once the distribution is known, we can compute probabilities, such as P(X=x), P(X<x) and P(X≥x). If the distribution is not known, simulation might be the solution.
(#exm:cointoss) Simulate coin toss
In a single coin toss the probabity of heads is 0.5. In 20 coin tosses, what is the probability of at least 15 heads?The outcome of a single coin toss is a random variable, X with two possible outcomes {H, T}. We know that P(X=H) = 0.5. The random variable of interest is the number of heads in 20 coin tosses, Y. The probability that we need to compute is P(Y≥15).
(#fig:coinurn)A coin toss. Urn model with one black ball (heads) and one white ball (tails).
A single coin toss can be modelled by an urn with two balls. When a ball is drawn randomly from the urn, the probability to get the black ball (heads) is P(X=H) = 0.5.
If we want to simulate tossing 20 coins (or one coin 20 times) we can use the same urn model, if the ball is replaced after each draw.
In R we can simulate random draws from an urn model using the function sample.
# A single coin toss
sample(c("H", "T"), size=1)## [1] "H"
# Another coin toss
sample(c("H", "T"), size=1)## [1] "H"
Every time you run sample a new coin toss is simulated.
The argument size tells the function how many balls we want to draw from the urn. To draw 20 balls from the urn, set size=20, remember to replace the ball after each draw!
# 20 independent coin tosses
(coins <- sample(c("H", "T"), size=20, replace=TRUE))## [1] "H" "H" "H" "H" "T" "H" "H" "T" "T" "T" "H" "T" "H" "H" "T" "H" "T" "T" "T"
## [20] "T"
How many heads did we get in the 20 random draws?
# How many heads?
sum(coins == "H")## [1] 10
We can repeat this experiment (toss 20 coins and count the number of heads) several times to estimate the distribution of number of heads in 20 coin tosses.
To do the same thing several times we use the function replicate.
To simulate tossing 20 coins and counting the number of heads 10000 times, do the following;
Nheads <- replicate(10000, {
coins <- sample(c("H", "T"), size=20, replace=TRUE)
sum(coins == "H")
})Plot distribution of the number of heads in a histogram.
hist(Nheads, breaks=0:20-0.5)Now, let’s get back to the question; when tossing 20 coins, what is the probability of at least 15 heads?
P(X≥15)
Count how many times out of our 10000 exeriments the number is 15 or greater
sum(Nheads >= 15)## [1] 200
From this we conclude that
P(X≥15)= 200/10000 = 0.02
A Bernoulli trial is a random experiment with two outcomes; success and failure. The probability of success, P(success) = p, is constant. The probability of failure is P(failure) = 1 − p.
When coding it is convenient to code success as 1 and failure as 0.
The outcome of a Bernoulli trial is a discrete random variable, X.
$$p(x) = \left\{ \begin{array}{ll} p & \mathrm{if}\,x=1\mathrm,\,success\\ 1-p & \mathrm{if}\,x=0\mathrm,\,failure \end{array} \right.$$
Using the definitions of expected value and variance it can be shown that;
$$E[X] = p\\ var(X) = p(1-p)$$
The number of successes in a series of independent and identical Bernoulli trials is a discrete random variable, X.
$X = \sum_{i=0}^n Z_i,$
where all Zi describe the outcome of independent and identical Bernoilli trials with probability p for success (P(Zi=1) = p).
The probability mass function of X is called the binomial distribution. In short we use the notation;
X ∈ Bin(n,p)
The probability mass function is
$$P(X=k) = {n \choose k} p^k (1-p)^{n-k}$$ It can be shown that
$$E[X] = np\\ var(X) = np(1-p)$$
The binomial distribution occurs when sampling n objects with replacement from an urn with objects of two types, of which the interesting type has probability p.
The probability mass function, P(X=k) can be computed using the R function dbinom and the cumulative distribution function P(X≤k) can be computed using pbinom.
The hypergeometric distribution occurs when sampling n objects without replacement from an urn with N objects of two types, of which the interesting type has probability p.
The probability density function
$$P(X=k) = \frac{{Np \choose x} {N-Np \choose n-x}}{N \choose n}$$ can be computed in R using dhyper and the cumulative distribution function P(X≤k) can be computed using phyper.
The Poisson distribution describe the number of times a rare event occurs in a large number of trials.
A rare disease has a very low probability for a single individual. The number of individuals in a large population that catch the disease in a certain time period can be modelled using the Poisson distribution.
The probability mass function;
$$P(X=k) = \frac{\mu}{k!}e^{-\mu},$$ where μ is the expected value, which is μ = nπ, where n is the number of objects sampled from the population and π is the probability of a single object.
The Poisson distribution can approximate the binomial distribution if n is large (n > 10) and π is small (π < 0.1).
Probability mass functions, P(X=x), for the binomial, hypergeometric and Poisson distributions can in R can be computed using functions dbinom, dhyper, and dpois, respectively.
Cumulative distribution functions, P(X≤x) can be computed using pbinom, phyper and ppois.
Also, functions for computing an x such that P(X≤x) = q, where q is a probability of interest are available using qbinom, qhyper, and qpois.
(#exr:probdie) When tossing a fair six-sided die
(#exr:Cointoss) In a single coin toss the probability of heads is 0.5.
In 20 coin tosses,
(#exr:Dice) When rolling 10 six-sided dice, study the number of sixes.
(#exr:GSEA) \iffalse (Gene set enrichment analysis) You have analyzed 20000 genes and a bioinformatician you are collaborating with has sent you a list of 1000 genes that she says are important. You are interested in a particular pathway A. 200 genes in pathway A are represented among the 20000 genes, 20 of these are in the bioinformaticians important list.
If the bioinformatician selected the 1000 genes at random, what is the probability to see 20 or more genes from pathway A in this list?| pos | neg | tot | |
|---|---|---|---|
| not cancer | 98 | 882 | 980 |
| cancer | 16 | 4 | 20 |
| total | 114 | 886 | 1000 |
the probability of a positive test result from a person with cancer?
the probability of a negative test result from a person without cancer?
the probability of having cancer, if the test is positive?
the probability of not having cancer, if the test is negative?
Connect the four computed probabilities with the following four terms;
@ref(exr:probcoin)
@ref(exr:probdie)
@ref(exr:Cointoss)
Simulate as in the lecture;
# A single coin toss
sample(c("H", "T"), size=1)## [1] "T"
# Another coin toss
sample(c("H", "T"), size=1)## [1] "H"
# 20 independent coin tosses
(coins <- sample(c("H", "T"), size=20, replace=TRUE))## [1] "H" "H" "T" "T" "T" "T" "H" "T" "H" "H" "H" "T" "T" "H" "T" "T" "H" "H" "T"
## [20] "H"
# How many heads did we get in these particular 20 draws?
sum(coins == "H")## [1] 10
## The simulation is about repeating this (20 random draws and summing up the number of heads) many times. To do it 10000 times;
Nheads <- replicate(10000, {
coins <- sample(c("H", "T"), size=20, replace=TRUE)
sum(coins == "H")
})## Numer of times of the 10000 with exactly 15 heads
sum(Nheads==15)## [1] 156
## divide by 10000 to get the probability
sum(Nheads==15)/10000## [1] 0.016
## or compute using mean (why does this work?)
mean(Nheads==15)## [1] 0.016
mean(Nheads<7)## [1] 0.055
## plot the distribution and read the graph
hist(Nheads, breaks=0:20-0.5)## or tabulate
table(Nheads)## Nheads
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 8 37 145 359 718 1188 1662 1751 1616 1211 710 387 156 41 9 2
To get five or less tails out of 20 throws is equal to getting 15 or more heads out of 20.
## probability of 15 heads or more
mean(Nheads>=15)## [1] 0.021
mean(Nheads<=2)## [1] 0
sum(Nheads<=2)## [1] 0
## with this low number of observations, more repeats is required to get a more accurate answer
Nheads <- replicate(1000000, {
coins <- sample(c("H", "T"), size=20, replace=TRUE)
sum(coins == "H")
})
sum(Nheads<=2)## [1] 215
mean(Nheads<=2)## [1] 0.00021
@ref(exr:Dice)
N <- replicate(100000, sum(sample(1:6, size=10, replace=TRUE)==6))
table(N)## N
## 0 1 2 3 4 5 6 7 8
## 16141 32153 29094 15502 5478 1366 234 30 2
hist(N, breaks=(0:11)-0.5)## [1] 1632
## [1] 0.016
## [1] 0.016
mean(N==2)## [1] 0.29
mean(N)## [1] 1.7
10*1/6## [1] 1.7
@ref(exr:Cards)
0.00049
N <- replicate(100000, sum(sample(rep(0:1, c(39,13)), size=5)))
hist(N, breaks=(0:6)-.5)# P(N==5)
mean(N==5)## [1] 0.00039
@ref(exr:Pollen)
## Solution using 100 replicates
x <- replicate(100, sum(sample(c(0,0,0,0,0,0,0,1,1,1), size=3, replace=TRUE)))
table(x)## x
## 0 1 2 3
## 42 43 13 2
mean(x==0)## [1] 0.42
## Solution using 1000 replicates
x <- replicate(1000, sum(sample(c(0,0,0,0,0,0,0,1,1,1), size=3, replace=TRUE)))
table(x)## x
## 0 1 2 3
## 358 429 193 20
mean(x==0)## [1] 0.36
## Solution using 100000 replicates
x <- replicate(100000, sum(sample(c(0,0,0,0,0,0,0,1,1,1), size=3, replace=TRUE)))
table(x)## x
## 0 1 2 3
## 34253 44113 18927 2707
mean(x==0)## [1] 0.34
@ref(exr:Pollen2)
## Solution using 100000 replicates
x <- replicate(100000, sum(sample(rep(c(0, 1), c(14, 6)), size=3, replace=FALSE)))
table(x)## x
## 0 1 2 3
## 32080 47809 18368 1743
mean(x==0)## [1] 0.32
@ref(exr:Pollen3)
## Solution using 100000 replicates
x <- replicate(100000, sum(sample(rep(c(0, 1), c(140, 60)), size=3, replace=FALSE)))
table(x)## x
## 0 1 2 3
## 34182 44396 18785 2637
mean(x==0)## [1] 0.34
@ref(exr:pollenparam)
## 1.6 Solution using the Binomial distribution
pbinom(0, 3, 0.3)## [1] 0.34
## 1.7 Solution using the hypergeometric distribution
phyper(0, 6, 20-6, 3)## [1] 0.32
## 1.8 Solution using the hypergeometric distribution
phyper(0, 66, 200-60, 3)## [1] 0.31
@ref(exr:GSEA)
phyper(20, 200, 20000-200, 1000, lower.tail=FALSE)## [1] 0.0011
@ref(exr:diagnostictests)
A continuous random number is not limited to discrete values, but any continuous number within one or several ranges is possible.
Examples: weight, height, speed, intensity, …
A continuous random variable can be described by its probability density function, pdf.
(#fig:unnamed-chunk-32)Probability density function of the weight of a newborn baby.
The probability density function, f(x), is defined such that the total area under the curve is 1.
∫−∞∞f(x)dx = 1
The area under the curve from a to b is the probability that the random variable X takes a value between a and b.
P(a≤X≤b) = ∫abf(x)dx
The cumulative distribution function, cdf, sometimes called just the distribution function, F(x), is defined as:
F(x) = P(X≤x) = ∫−∞xf(x)dx
P(X≤x) = F(x)
As we know that the total probability (over all x) is 1, we can conclude that
P(X>x) = 1 − F(x) and thus
P(a<X≤b) = F(b) − F(a)
Two important parameters of a distribution is the expected value, μ, that describe the distributions location and the variance, σ2, that describe the spread.
The expected value, or population mean, is defined as;
E[X] = μ = ∫−∞∞xf(x)dx We will learn more about the expected value and how to estimate a population mean from a sample later in the course.
The variance is defined as the expected value of the squared distance from the population mean;
σ2 = E[(X−μ)2] = ∫−∞∞(x−μ)2f(x)dx
The square root of the variance is called the standard deviation, σ.
The normal distribution (sometimes referred to as the Gaussian distribution) is a common probability distribution and many continuous random variables can be described by the normal distribution or be approximated by the normal distribution.
The normal probability density function
$$f(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}$$
describes the distribution of a normal random variable, X, with expected value μ and standard deviation σ, e and π are two common mathematical constants, e ≈ 2.71828 and π ≈ 3.14159.
In short we write X ∼ N(μ,σ).
The bell-shaped normal distributions is symmetric around μ and f(x) → 0 as x → ∞ and as x → − ∞.
As f(x) is well defined, values for the cumulative distribution function F(x) = ∫−∞xf(x)dx can be computed.
If X is normally distributed with expected value μ and standard deviation σ we write:
X ∼ N(μ,σ)
Using transformation rules we can define
$$Z = \frac{X-\mu}{\sigma}, \, Z \sim N(0,1)$$
Values for the cumulative standard normal distribution, F(z), are tabulated and easy to compute in R using the function pnorm.
(#fig:FZ)The shaded area under hte curve is the tabulated value P(Z≤z) = F(z).
| 0 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 | 0.08 | 0.09 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.0 | 0.5000 | 0.5040 | 0.5080 | 0.5120 | 0.5160 | 0.5199 | 0.5239 | 0.5279 | 0.5319 | 0.5359 |
| 0.1 | 0.5398 | 0.5438 | 0.5478 | 0.5517 | 0.5557 | 0.5596 | 0.5636 | 0.5675 | 0.5714 | 0.5753 |
| 0.2 | 0.5793 | 0.5832 | 0.5871 | 0.5910 | 0.5948 | 0.5987 | 0.6026 | 0.6064 | 0.6103 | 0.6141 |
| 0.3 | 0.6179 | 0.6217 | 0.6255 | 0.6293 | 0.6331 | 0.6368 | 0.6406 | 0.6443 | 0.6480 | 0.6517 |
| 0.4 | 0.6554 | 0.6591 | 0.6628 | 0.6664 | 0.6700 | 0.6736 | 0.6772 | 0.6808 | 0.6844 | 0.6879 |
| 0.5 | 0.6915 | 0.6950 | 0.6985 | 0.7019 | 0.7054 | 0.7088 | 0.7123 | 0.7157 | 0.7190 | 0.7224 |
| 0.6 | 0.7257 | 0.7291 | 0.7324 | 0.7357 | 0.7389 | 0.7422 | 0.7454 | 0.7486 | 0.7517 | 0.7549 |
| 0.7 | 0.7580 | 0.7611 | 0.7642 | 0.7673 | 0.7704 | 0.7734 | 0.7764 | 0.7794 | 0.7823 | 0.7852 |
| 0.8 | 0.7881 | 0.7910 | 0.7939 | 0.7967 | 0.7995 | 0.8023 | 0.8051 | 0.8078 | 0.8106 | 0.8133 |
| 0.9 | 0.8159 | 0.8186 | 0.8212 | 0.8238 | 0.8264 | 0.8289 | 0.8315 | 0.8340 | 0.8365 | 0.8389 |
| 1.0 | 0.8413 | 0.8438 | 0.8461 | 0.8485 | 0.8508 | 0.8531 | 0.8554 | 0.8577 | 0.8599 | 0.8621 |
| 1.1 | 0.8643 | 0.8665 | 0.8686 | 0.8708 | 0.8729 | 0.8749 | 0.8770 | 0.8790 | 0.8810 | 0.8830 |
| 1.2 | 0.8849 | 0.8869 | 0.8888 | 0.8907 | 0.8925 | 0.8944 | 0.8962 | 0.8980 | 0.8997 | 0.9015 |
| 1.3 | 0.9032 | 0.9049 | 0.9066 | 0.9082 | 0.9099 | 0.9115 | 0.9131 | 0.9147 | 0.9162 | 0.9177 |
| 1.4 | 0.9192 | 0.9207 | 0.9222 | 0.9236 | 0.9251 | 0.9265 | 0.9279 | 0.9292 | 0.9306 | 0.9319 |
| 1.5 | 0.9332 | 0.9345 | 0.9357 | 0.9370 | 0.9382 | 0.9394 | 0.9406 | 0.9418 | 0.9429 | 0.9441 |
| 1.6 | 0.9452 | 0.9463 | 0.9474 | 0.9484 | 0.9495 | 0.9505 | 0.9515 | 0.9525 | 0.9535 | 0.9545 |
| 1.7 | 0.9554 | 0.9564 | 0.9573 | 0.9582 | 0.9591 | 0.9599 | 0.9608 | 0.9616 | 0.9625 | 0.9633 |
| 1.8 | 0.9641 | 0.9649 | 0.9656 | 0.9664 | 0.9671 | 0.9678 | 0.9686 | 0.9693 | 0.9699 | 0.9706 |
| 1.9 | 0.9713 | 0.9719 | 0.9726 | 0.9732 | 0.9738 | 0.9744 | 0.9750 | 0.9756 | 0.9761 | 0.9767 |
| 2.0 | 0.9772 | 0.9778 | 0.9783 | 0.9788 | 0.9793 | 0.9798 | 0.9803 | 0.9808 | 0.9812 | 0.9817 |
| 2.1 | 0.9821 | 0.9826 | 0.9830 | 0.9834 | 0.9838 | 0.9842 | 0.9846 | 0.9850 | 0.9854 | 0.9857 |
| 2.2 | 0.9861 | 0.9864 | 0.9868 | 0.9871 | 0.9875 | 0.9878 | 0.9881 | 0.9884 | 0.9887 | 0.9890 |
| 2.3 | 0.9893 | 0.9896 | 0.9898 | 0.9901 | 0.9904 | 0.9906 | 0.9909 | 0.9911 | 0.9913 | 0.9916 |
| 2.4 | 0.9918 | 0.9920 | 0.9922 | 0.9925 | 0.9927 | 0.9929 | 0.9931 | 0.9932 | 0.9934 | 0.9936 |
| 2.5 | 0.9938 | 0.9940 | 0.9941 | 0.9943 | 0.9945 | 0.9946 | 0.9948 | 0.9949 | 0.9951 | 0.9952 |
| 2.6 | 0.9953 | 0.9955 | 0.9956 | 0.9957 | 0.9959 | 0.9960 | 0.9961 | 0.9962 | 0.9963 | 0.9964 |
| 2.7 | 0.9965 | 0.9966 | 0.9967 | 0.9968 | 0.9969 | 0.9970 | 0.9971 | 0.9972 | 0.9973 | 0.9974 |
| 2.8 | 0.9974 | 0.9975 | 0.9976 | 0.9977 | 0.9977 | 0.9978 | 0.9979 | 0.9979 | 0.9980 | 0.9981 |
| 2.9 | 0.9981 | 0.9982 | 0.9982 | 0.9983 | 0.9984 | 0.9984 | 0.9985 | 0.9985 | 0.9986 | 0.9986 |
| 3.0 | 0.9987 | 0.9987 | 0.9987 | 0.9988 | 0.9988 | 0.9989 | 0.9989 | 0.9989 | 0.9990 | 0.9990 |
| 3.1 | 0.9990 | 0.9991 | 0.9991 | 0.9991 | 0.9992 | 0.9992 | 0.9992 | 0.9992 | 0.9993 | 0.9993 |
| 3.2 | 0.9993 | 0.9993 | 0.9994 | 0.9994 | 0.9994 | 0.9994 | 0.9994 | 0.9995 | 0.9995 | 0.9995 |
| 3.3 | 0.9995 | 0.9995 | 0.9995 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9996 | 0.9997 |
| 3.4 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9997 | 0.9998 |
Some value of particular interest:
$$F(1.64) = 0.95\\ F(1.96) = 0.975$$
As the normal distribution is symmetric F(-z) = 1 - F(z)
$$F(-1.64) = 0.05\\ F(-1.96) = 0.025$$
P(−1.96<Z<1.96) = 0.95
If X ∼ N(μ1,σ1) and Y ∼ N(μ2,σ2) are two independent normal random variables, then their sum is also a random variable:
$$X + Y \sim N(\mu_1 + \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})$$
and
$$X - Y \sim N(\mu_1 - \mu_2, \sqrt{\sigma_1^2 + \sigma_2^2})$$ This can be extended to the case with n independent and identically distributed random varibles Xi (i = 1…n). If all Xi are normally distributed with mean μ and standard deviation σ, Xi ∈ N(μ,σ), then the sum of all n random variables will also be normally distributed with mean nμ and standard deviation $\sqrt{n} \sigma$.
As a result of central limit theorem, the distribution of fractions or mean values of a sample follow the normal distribution, at least if the sample is large enough (a rule of thumb is that the sample size n > 30).
(#exm:BMdistr) Mean BMI
In a population of 252 men we can study the distribution of BMI.##Population mean
(mu <- mean(fat$BMI))## [1] 25
##Population variance
(sigma2 <- var(fat$BMI)/nrow(fat)*(nrow(fat)-1))## [1] 13
##Population standard variance
(sigma <- sqrt(sigma2))## [1] 3.6
Randomly sample 3, 5, 10, 15, 20, 30 men and compute the mean value, m. Repeat many times to get the distribution of mean values.
Note, mean is just the sum divided by the number of samples n.
The random variable $Y = \sum_{i=1}^n X_i^2$ is χ2 distributed with n − 1 degrees of freedom, if Xi are independent identically distributed random variables Xi ∈ N(0,1).
In short Y ∈ χ2(n−1).
(#fig:Xdistr)The χ2-distribution.
The ratio of two χ2-distributed variables divided by their degrees of freedom is F-distributed
(#fig:Fdistr)The F-distribution
The ratio of a normally distributed variable and a χ2-distributed variable is t-distributed.
(#fig:exampletdistr)The t-distribution.
(#exm:tdistr) t-distribution
The ratio between sample mean and sample variance is t-distributed.Probability density functions for the normal, t, χ2 and F distributions can in R can be computed using functions dnorm, dt, dchisq, and df, respectively.
Cumulative distribution functions can be computed using pnorm, pt, pchisq and pf.
Also, functions for computing an x such that P(X<x) = q, where q is a probability of interest are available using qnorm, qt, qchisq and qf.
(#exr:normtable) \iffalse (Exercise on using the normal table ) Let Z ∼ N(0,1) be a standard normal random variable, and compute;
pnormand qnorm.
(#exr:ztransform) \iffalse (Exercise in standardization/transformation) If X ∼ N(3,2), compute the probabilities
(#exr:poisson) A rare disease affects 3 in 100000 in a large population. If 10000 people are randomly selected from the population, what is the probability
@ref(exr:normtable)
@ref(exr:ztransform)
@ref(exr:sumdistr)
pnorm(158, mean=188, sd=14)## [1] 0.016
@ref(exr:poisson) a)
n <- 10000
p <- 3/100000
ppois(0, n*p)## [1] 0.74
ppois(1, n*p, lower.tail=FALSE)## [1] 0.037
In many (most) experiments it is not feasible (or even possible) to examine the entire population. Instead we study a random sample.
A random sample is a random subset of individuals from a population.
A simple random sample is a random subset of individuals from a population, where every individual has the same probability of being choosen.
Let every individual in the population be represented by a ball. The value on each ball is the measurement we are interested in, for example height, shoe size, hair color, healthy/sick, type of cancer/no cancer, blood glucose value, etc.
Draw n balls from the urn, without replacement, to get a random sample of size n.
Summary statistics can be computed for a sample, such as the sum, proportion, mean and variance.
Notation: A random sample x1, x2, …, xn from a distribution, D, consists of n observations of the independent random variables X1, X2, …, Xn all with the distribution D.
The proportion of a population with a particular property is π.
The number of individuals with the property in a simple random sample of size n is a random variable X. The proportion of individuals in a sample with the property is also a random variable;
$$P = \frac{X}{n}$$ with expected value $$E[P] = \frac{E[X]}{n} = \frac{n\pi}{n} = \pi$$
For a particular sample of size n; x1, …, xn, the sample mean is denoted m = x̄. The sample mean is calculated as;
$$m = \bar x = \frac{1}{n}\displaystyle\sum_{i=1}^n x_i$$ and the sample variance as;
$$s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i-m)^2$$
Note that the mean of n independent identically distributed random variables, Xi is itself a random variable;
$$\bar X = \frac{1}{n}\sum_{i=1}^n X_i,$$ If Xi ∼ N(μ,σ) then $\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)$.
When we only have a sample of size n, the sample mean m is our best estimate of the population mean. It is possible to show that the sample mean is an unbiased estimate of the population mean, i.e. the average (over many size n samples) of the sample mean is μ.
$$E[\bar X] = \frac{1}{n} n E[X] = E[X] = \mu$$
Similarly, the sample variance is an unbiased estimate of the population variance.
Eventhough the sample can be used to calculate unbiased estimates of the population value, the sample estimate will not be perfect. The standard deviation of the sampling distribution (the distribution of sample estimates) is called the standard error.
For the sample mean, X̄, the variance is
$$E[(\bar X - \mu)] = var(\bar X) = var(\frac{1}{n}\sum_i X_i) = \frac{1}{n^2} \sum_i var(X_i) = \frac{1}{n^2} n var(X) = \frac{\sigma^2}{n}$$ The standard error of the mean is thus;
$$SEM = \frac{\sigma}{\sqrt{n}}$$ Replacing σ with the sample standard deviation, s, we get an estimate of the standard deviation of the mean;
$$SEM \approx \frac{s}{\sqrt{n}}$$ An alternative definition of standard error of the mean is actually
$$SEM = \frac{s}{\sqrt{n}}$$
Learning outcomes:
Statistical inference is to draw conclusions regarding properties of a population based on observations of a random sample from the population.
To perform a hypothesis test is to evaluate a hypothesis based on a random sample.
Typicaly, the hypotheses that are tested are assumptions about properties of the population, such as proportion, mean, mean difference, variance etc.
There are two hypotheses involved in a hypothesis test, the null hypothesis, H0, and the alternative hypothesis, H1.
The null hypothesis is in general neutral, “no change”, “no difference between groups”, “no association”. In general we want to show that H0 is false.
The alternative hypothesis expresses what the researcher is interested in “the treatment has an effect”, “there is a difference between groups”, “there is an association”. The alternative hypothesis can also be directional “the treatment has a positive effect”.
(#def:samplingdistribution) Sampling distribution
A sampling distribution is the distribution of a sample statistic. The samplling distribution can be obtained by drawing a large number of samples from a specific population.(#def:nulldistribution) Null distribution
The null distribution is a sampling distribution when the null hypothesis is true.(#fig:examplenull)A null distribution
(#def:pvalue) p-value
The p-value is the probability of the observed value, or something more extreme, if the null hypothesis is true.(#fig:examplepval)The p-value is the probability to observe xobs or something more extreme, if the null hypothesis is true.
| H0 is true | H0 is false | |
| Accept H0 | Type II error, miss | |
| Reject H0 | Type I error, false alarm |
The significance level, α = P(false alarm) = P(Reject H0|H0 is true).
The significance level is the risk to of false alarm, i.e. to say “I have a hit”, “I found a difference”, when the the null hypothesis (“there is no difference”) is true. The risk of false alarm is control by setting the significance level to a disired value. We do want to keep the risk of false alarm (type I error) low, but at the same time we don’t want to many missed hits (type II error).
The significance level should be set before the hypothesis test is performed. Common values to use are 0.05 or 0.01.
If the p-value is above the significance level, H0 is accepted.
If the p-value is below the significance level, H0 is rejected.
In these examples the significance level is set to 0.05.
(#exm:simpollentest) Pollen allergy
Let’s assume we know that the proportion of pollen allergy in Sweden is 0.3. We suspect that the number of pollen allergic has increased in Uppsala in the last couple of years and want to investigate this.
Observe 100 people from Uppsala, 42 of these were allergic to pollen. Is there a reason to believe that the proportion of pollen allergic in Uppsala π > 0.3?H0: The proportion of pollen allergy in Uppsala is the same as in Sweden as a whole.
H1: The proportion of pollen allergy in Uppsala is not the same as in Sweden as a whole.
or expressed differently;
H0: π = π0
H1: π > π0 where π is the unknown proportion of pollen allergy in the Uppsala population that. π0 = 0.3 is the proportion of pollen allergy in Sweden.
Here we are interested in the proportion of pollen allergic in Uppsala. An appropriate test statistic could be the number of pollen allergic in a sample of size n = 100, X. As an alternative we can use the proportion of pollen allergic in a sample of size n,
$$P = \frac{X}{n}$$
Let’s use P as our test statistic and compute the observed value, pobs. In our sample of 100 people from Uppsala the proportion allergic to pollen is p = 42/100 = 0.42.
The sampling distribution of P under H0 (i.e. when the null hypothesis is true) is what we call the null distribution.
H0 state that π = 0.3. We can model this using an urn model as follows;
(#fig:pollenurn)An urn model of the null hypothesis π = 0.3. The black balls represent allergic and the white balls non-allergic.
Using this model, we can simulate taking a sample of size 100 many times.
## Urn
rep(c(0, 1), c(7, 3))## [1] 0 0 0 0 0 0 0 1 1 1
## Sample 100 times with replacement
sample(rep(c(0, 1), c(7, 3)), 100, replace=TRUE)## [1] 0 0 0 1 0 0 1 1 1 0 0 1 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1
## [38] 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0
## [75] 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 1 1 1 0 1 0 0 0
## Compute proportion of samples that are allergic (1)
sum(sample(rep(c(0, 1), c(7, 3)), 100, replace=TRUE))/100## [1] 0.3
## Draw samples of size 100 and compute proporion allergic 100000 times
p <- replicate(100000, mean(sample(rep(c(0, 1), c(7, 3)), 100, replace=TRUE)))Finally plot the distribution
(#fig:pollensampledistr)The sampling distribution.
Compare the observed value, pobs = 0.42 to the null distribution.
(#fig:unnamed-chunk-41)The sampling distribution. The observed value is marked by a red vertical line.
The p-value is the probability of getting the observed value or higher, if the null hypothesis is true.
Use the null distribution to calculate the p-value, P(P≥0.42|H0).
## How many times
sum(p >= 0.42)## [1] 689
## p-value
sum(p >= 0.42)/length(p)## [1] 0.00689
p = P(P≥0.42|H0) = 0.00689
(#exm:permutationtest) Permutation test: Do high fat diet lead to increased body weight?
Study setup:
The observed values, mouse weights in grams, are summarized below;
| high-fat | 25 | 30 | 23 | 18 | 31 | 24 | 39 | 26 | 36 | 29 | 23 | 32 |
| ordinary | 27 | 25 | 22 | 23 | 25 | 37 | 24 | 26 | 21 | 26 | 30 | 24 |
$$ \begin{aligned} H_0: \mu_2 = \mu_1 \iff \mu_2 - \mu_1 = 0\\ H_1: \mu_2>\mu_1 \iff \mu_2-\mu_1 > 0 \end{aligned} $$
where μ2 is the (unknown) mean body weight of the high-fat mouse population and μ1 is the mean body-weight of the control mouse population.
Studied population: Female mice that can be ordered from a lab.
Here we are interested in the mean difference between high-fat and control mice.
Mean weight of 12 (randomly selected) mice on ordinary diet, X̄1. E[X̄1] = E[X1] = μ1
Mean weight of 12 (randomly selected) mice on high-fat diet, X̄2. E[X̄2] = E[X2] = μ2
The mean difference is also a random variable: D = X̄2 − X̄1
Observed values;
## 12 HF mice
xHF <- c(25, 30, 23, 18, 31, 24, 39, 26, 36, 29, 23, 32)
## 12 control mice
xN <- c(27, 25, 22, 23, 25, 37, 24, 26, 21, 26, 30, 24)
##Compute mean body weights of the two samples
mHF <- mean(xHF)
mN <- mean(xN)
## Compute mean difference
dobs <- mHF - mNMean weight of control mice (ordinary diet): x̄1 = 25.83
Mean weight of mice on high-fat diet: x̄2 = 28.00
Difference in mean weights: dobs = x̄2 − x̄1 = 2.17
If high-fat diet has no effect, i.e. if H0 was true, the result would be as if all mice were given the same diet. What can we expect if all mice are fed with the same type of food?
This can be accomplished using permutation
The 24 mice were initially from the same population, depending on how the mice are randomly assigned to high-fat and normal group, the mean weights would differ, even if the two groups were treated the same.
Assume H0 is true, i.e. assume all mice are equivalent and
If we repeat 1-2 many times we get the sampling distribution when H0 is true, the so called null distribution, of difference in mean weights.
## All 24 body weights in a vector
x <- c(xHF, xN)
## Mean difference
(dobs <- mean(x[1:12]) - mean(x[13:24]))## [1] 2.167
## Permute once
(y <- sample(x))## [1] 29 36 39 31 26 24 25 23 25 27 32 37 30 23 30 18 24 23 24 26 22 26 21 25
##Compute mean difference
mean(y[1:12]) - mean(y[13:24])## [1] 5.167
dnull.perm <- replicate(n = 100000, {
y <- sample(x)
##Mean difference
mean(y[1:12]) - mean(y[13:24])
})
ggplot(data.frame(d=dnull.perm), aes(x=d)) +
geom_histogram(bins=25, color="white") +
theme_bw() +
geom_vline(xintercept=dobs, color="red")##Alternatively plot using hist
## hist(dnull.perm)What is the probability to get an at least as extreme mean difference as our observed value, dobs, if H0 was true?
## Compute the p-value
sum(dnull.perm>dobs)/length(dnull.perm)## [1] 0.1492
mean(dnull.perm>dobs)## [1] 0.1492
$P(X_2 - X_1 d_{obs} | H_0) = $ 0.168
Conclusion?
(#exr:pollentest) You believe that the proportion of Swedish students allergic to pollen is greater than 0.3 (the proportion allergic to pollen in Sweden). To test this you observe 20 people in a student group at BMC in Uppsala, 9 or them are allergic to pollen. Is this reason to believe that the proportion of Swedish students allergic to pollen i greater than 0.3?
Can you identify any problems with this study setup?(#exr:diet) A diet study aims to study how the hemoglobin (Hb) levels in blood are affected by an iron-rich diet consisting of tofu, soybeans, broccoli, lentils and peas. To perform the study the dietician has recruited 40 male participants, who are randomly assigned to the iron-rich diet or control group (no change in participants diet), 20 participant in each group.
The observed Hb levels (in g/L);ctrl <- c(197, 186, 157, 170, 193, 188, 175, 186, 177, 191, 168, 193, 191, 189, 188, 192, 179, 186, 197, 203)
iron <- c(187, 218, 196, 210, 206, 178, 181, 193, 172, 202, 169, 221, 183, 222, 185, 174, 192, 192, 162, 211)Perform a hypothesis test to investigate if the Hb level is affected (increased or decreased) by the iron-rich diet.
In previous chapters we have computed the sampling distribution using resampling techniques to be able to perform hypothesis tests or compute interval estimates. If the null distribution was already known (or could be computed based on a few assumptions) resampling would not be necessary.
We can follow the same steps as before to perform a hypothesis test:
(#exm:parampollen) Let’s get back to the pollen example!
Assume that the proportion of pollen allergy in Sweden is known to be 0.3. Observe 100 people from Uppsala, 42 of these were allergic to pollen. Is there a reason to believe that the proportion of pollen allergic in Uppsala π > 0.3?The number of allergic individuals in a sample of size n is X and the proportion of allergic persons is P = X/n. X is binomially distributed, but here we can use the Central limit theorem, see @ref(thm:CLT).
As a result of the central limit theorem, the distribution of number or proportion of allergic individuals in a sample of size n is approximately normal. At least if the sample is large enough. A rule of thumb is that the sample size should be n > 30.
Here, the sample size is 100!
The normal distribution has two parameters, mean and standard deviation.
From the binomial distribution we know that E[X] = π and var(X) = nπ(1−π). Hence E[P] = π and $var(P) = \frac{\pi(1-\pi)}{n}$.
The standard error is thus
$$SE=\sqrt{\frac{\pi(1-\pi)}{n}}$$
When the null hypothesis is true π is known and π = 0.3.
Actually these calculations are true in general when a proportion in one sample is compared to a known value.
$$H_0: \pi=\pi_0 \\ H_1: \pi>\pi_0 $$
Ather potential alternative hypothesis are H1 : π < π0 or H1 : π ≠ π0, but in this particular example we are only interested in the alternative that π > π0.
If H0 is true π = π0 and
$$P \sim N\left(\pi_0, \sqrt{\frac{\pi_0(1-\pi_0)}{n}}\right)$$ An appropriate test statistic is
$$Z = \frac{P-\pi_0}{\sqrt{\frac{\pi_0(1-\pi_0)}{n}}}$$
Z ∈ N(0,1) which makes probabilities easy to compute.
Back to our example, replace P with our observed value p = 0.42 and π0 = 0.3 and compute our observed
$$Z_{obs} = \frac{0.42-0.3}{\sqrt{\frac{0.3(1-0.3)}{100}}} = 2.62$$
The p-value is the probability of the observed value, or something more extreme, if the null hypothesis is true. If the computed probability is below α = 0.05 our significance threshold, H0 will be rejected.
p = P(P≥π0) = P(Z≥Zobs) = P(Z≥2.62) = 1 − P(Z≤2.62) = [table] = 1 − 0.996 = 0.0044
As 0.0044<0.05 we reject H0 and conclude that there is reason to believe that the proportion of allergic in Uppsala is greater than 0.3.
A one sample test of means compares the mean of a sample to a prespecified value.
For example, we might know that the weight of a mouse on normal diet is normally distributed with mean 24.0 g and standard deviation 3 g and want to compare the weight of a sample of 10 mice on high-fat diet to the known mean value for mice on normal diet.
The hypotheses:
$$H_0: \mu = \mu_0 \\ H_1: \mu \neq \mu_0$$
The alternative hypothesis, H1, above is for the two sided hypothesis test. Other options are the one sided H1; H1 : μ > μ0 or H1 : μ < μ0.
If X ∼ N(μ,σ) (this could for example be the weight of a mouse on high-fat diet) then the sample mean $$\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right).$$
If σ is known the test statistic
$$Z = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}}$$ is normally distributed, ∼ N(0,1).
For small n and unknown σ, the test statistic
$$t = \frac{\bar X - \mu}{\frac{s}{\sqrt{n}}}$$
is t-distributed with df = n − 1 degrees of freedom.
$$H_0: \pi_1 - \pi_2 = 0\\ H_1: \pi_1 - \pi_2 \neq 0$$
Alternatively, a one sided alternative hypothesis can be used; H1 : π1 − π2 > 0 or H1 : π1 − π2 < 0.
Test statistic
$$Z = \frac{P_1 - P_2}{\sqrt{P(1-P)\left (\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$
where P is the proportion in the merged sample of size n1 + n2. Z ∈ N(0,1) and p-value can be computed using the standard normal distribution.
A two sample test of means is used to determine if two population means are equal.
Two independent samples are collected (one from each population) and the means are compared. Can for example be used to determine if a treatment group is different compared to a control group, in terms of the mean of a property of interest.
The null hypothesis;
H0 : μ2 = μ1 The alternative hypothesis can either be two sided
H1 : μ2 ≠ μ1 or one sided
H1 : μ2 > μ1 or
H1 : μ2 < μ1
Assume that observations from both populations are normally distributed;
$$ \begin{aligned} X_1 \sim N(\mu_1, \sigma_1) \\ X_2 \sim N(\mu_2, \sigma_2) \end{aligned} $$ Then it follows that the sample means will also be normally distributed;
$$ \begin{aligned} \bar X_1 \sim N(\mu_1, \sigma_1/\sqrt{n_1}) \\ \bar X_2 \sim N(\mu_2, \sigma_2/\sqrt{n_2}) \end{aligned} $$
The mean difference D = X̄2 − X̄1 is thus also normally distributed:
$$D = \bar X_2 - \bar X_1 = N\left(\mu_2-\mu_1, \sqrt{\frac{\sigma_2^2}{n_2} + \frac{\sigma_1^2}{n_1}}\right)$$
If H0 is true: $$D = \bar X_2 - \bar X_1 = N\left(0, \sqrt{\frac{\sigma_2^2}{n_2} + \frac{\sigma_1^2}{n_1}}\right)$$
The test statistic: $$Z = \frac{\bar X_2 - \bar X_1}{\sqrt{\frac{\sigma_2^2}{n_2} + \frac{\sigma_1^2}{n_1}}}$$ is standard normal, i.e. Z ∼ N(0,1).
However, note that the test statistic require the standard deviations σ1 and σ2 to be known.
What if the population standard deviations are not known?
If the sample sizes are large, we can replace the known standard deviations with our sample standard deviations and according to the central limit theorem assume that
$$Z = \frac{\bar X_2 - \bar X_1}{\sqrt{\frac{s_2^2}{n_2} + \frac{s_1^2}{n_1}}} \sim N(0,1)$$
and proceed as before.
For small sample sizes the test statistic will be t-distributed.
$$t = \frac{\bar X_2 - \bar X_1}{\sqrt{\frac{s_2^2}{n_2} + \frac{s_1^2}{n_1}}}$$
For small sample sizes we can use Student’s t-test, which requires us to assume that X1 and X2 both are normally distributed and have equal variances. With these assumptions we can compute the pooled variance
$$ s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2} $$
and the test statistic
$$t = \frac{\bar X_1 - \bar X_2}{\sqrt{s_p^2(\frac{1}{n_1} + \frac{1}{n_2})}}$$
t is t-distributed with n1 + n2 − 2 degrees of freedom.
The t-test is implemented in R, e.g. in the function t.test in the R-package stats, both Student’s t-test with equal variances and Welch’s t-test with unequal variances.
The test of equal variance in two groups is based on the null hypothesis
H0 : σ12 = σ22
If the two samples both come from two populations with normal distributions, the sample variances
$$S_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1} (X_{1i}-\bar X_1)^2\\ S_2^2 = \frac{1}{n_2-1} \sum_{i=1}^{n_2} (X_{2i}-\bar X_2)^2$$
It can be shown that $\frac{(n_1-1)S_1^2}{\sigma_1^2} \sim \chi^2(n_1-1)$ and $\frac{(n_2-1)S_2^2}{\sigma_2^2} \sim \chi^2(n_2-1)$.
Hence, the test statistic for comparing the variances of two groups
$$F = \frac{S_1^2}{S_2^2}$$ is F-distributed with n1 − 1 and n2 − 1 degrees of freedom.
In R a test of equal variances can be performed using the function var.test.
| high-fat | 25 | 30 | 23 | 18 | 31 | 24 | 39 | 26 | 36 | 29 | 23 | 32 |
| ordinary | 27 | 25 | 22 | 23 | 25 | 37 | 24 | 26 | 21 | 26 | 30 | 24 |
Does high fat diet increase body weight in mice?
@ref(exr:Hb)
x <- c(154, 140, 147, 162, 172)
## sample mean
m <- mean(x)
## standard error of mean
SE <- sd(x)/sqrt(5)
## Observed value of test statistic
tobs <- (m-140)/SE
## P(t>tobs)
pt(tobs, df=4, lower.tail=FALSE)## [1] 0.0277
## p = P(t>tobs) + P(t<-tobs) = 2 * P(t>tobs)
p <- 2*pt(tobs, df=4, lower.tail=FALSE)
p## [1] 0.05541
@ref(exr:mice12) a)
# Student's t-test with pooled variances
t.test(xHF, xN, var.equal=TRUE, alternative="greater")##
## Two Sample t-test
##
## data: xHF and xN
## t = 1, df = 22, p-value = 0.2
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -1.469 Inf
## sample estimates:
## mean of x mean of y
## 28.00 25.83
# Unequal variances with Welch approximation to the degrees of freedom (the default)
t.test(xHF, xN, var.equal=FALSE, alternative="greater")##
## Welch Two Sample t-test
##
## data: xHF and xN
## t = 1, df = 20, p-value = 0.2
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -1.486 Inf
## sample estimates:
## mean of x mean of y
## 28.00 25.83
As seen in previous chapter, the sample proportion or mean is an unbiased estimate of the population values. When we only have a sample, the sample estimate will be our best guess of the population value, but it will not be without error.
If we are interested in how large proportion of the Uppsala population is allergic to pollen, we can investigate this by studying a random sample. Randomly select 100 persons in Uppsala. It is important to actually sample randomly, ideally every individual should have the same probability of being sampled.
In our sample, we observe that 42 of the 100 has a pollen allergy. Hence, the observed sample proportion is p = 0.42.
Based on this observation our point estimate of the Uppsla popultation proportion π is π ≈ p = 0.42. We know that there is a certain uncertainty in this measurement, if the experiment is repeated we would select 100 other persons and our point estimate would be slightly different.
Using bootstrap we can sample with replacement from our sample to estimate the uncertainty.
Bootstrap is to use the data we have (our sample) and sample repeatedly with replacement from this data.
Put the entire sample in an urn!
(#fig:pollenurn42)An urn model with 42 allergy (black) and 58 non-allergy (white). The black balls represent allergic and the white balls non-allergic.
Sample from the urn with replacement to compute the bootstrap distribution.
Using the bootstrap distribution the uncertainty of our estimate of π can be estimated.
The 95% bootstrap interval is [0.32, 0.52].
The bootstrap is very useful if you do not know the distribution of our sampled propery. But in our example we actually do.
A confidence interval is a type of interval estimate associated with a confidence level.
Remember that we can use the central limit theorem to show that
$$P \sim N\left(\pi, SE\right) \iff P \sim \left(\pi, \sqrt{\frac{\pi(1-\pi)}{n}}\right)$$
It follows that
$$Z = \frac{P - \pi}{SE} \sim N(0,1)$$ Based on what we know of the standard normal distribution, we can compute an interval around the population property π such that the probability that a sample property p fall within this interval is 1 − α.
$$P(-z < \frac{P - \pi}{SE} < z) = 1 - \alpha$$ For a 95% confidence interval z=1.96 (from a table of the standard normal distribution). Other confidence levels of interest include 90% (z=1.64) and 99% (z=2.58).
$$P\left(-z < \frac{P-\pi}{SE}<z\right) = \left(-z < Z <z\right) = 1-\alpha$$ We can rewrite this to
P(π−zSE<P<π+zSE) = 1 − α in other words sample fraction p will fall between π ± 1.96SE with 95% probability.
The equation can also be rewritten to P(P−zSE<π<P+zSE) = 1 − α The observed confidence interval is what we get when we replace the random variable P with our observed fraction,
p − zSE < π < p + zSE $$\pi = p \pm z SE = p \pm z \sqrt{\frac{p(1-p)}{n}}$$ The 95% confidence interval $$\pi = p \pm 1.96 \sqrt{\frac{p(1-p)}{n}}$$
A 95% confidence interval will have 95% chance to cover the true value.
Back to our example of proportion pollen allergic in Uppsala. p = 0.42 and $SE=\sqrt{\frac{p(1-p)}{n}} = 0.05$.
Hence, the 95% confidence interval is π = 0.42 ± 1.96 * 0.05 = 0.42 ± 0.092 or (0.42−0.092,0.42+0.092) = (0.32,0.52)
The confidence interval of mean can be derived similarly.
The mean of a sample of n independent and identically normal distributed observations Xi is normally distributed;
$$\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})$$
If σ is unknown the statistic
$$\frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} \sim t(n-1)$$ is t-distributed with n − 1 degrees of freedom.
It follows that
$$ \begin{aligned} P\left(-t < \frac{\bar X - \mu}{\frac{\sigma}{\sqrt{n}}} < t\right) = 1 - \alpha \iff \\ P\left(\bar X - t \frac{\sigma}{\sqrt{n}} < \mu < \bar X + t \frac{\sigma}{\sqrt{n}}\right) = 1 - \alpha \end{aligned} $$
The confidence interval with confidence level 1 − α is thus;
$$\mu = \bar x \pm t \frac{s}{\sqrt{n}}$$
For a 95% confidence interval and n = 5 t is 2.78.
The t values for different values of α and degrees of freedom are tabulated and can be computed in R using the function qt.
n=5
alpha = 0.05
## t value
qt(1-alpha/2, df=n-1)## [1] 2.776
(#exr:CIHb) You measure the Hb value in 10 50-year old men and get the following observations; 145, 165, 134, 167, 158, 176, 156, 189, 143, 123 g/L.
(#exr:CIprop) In the pollen example we calculated a 95% confidence interval.
(#exr:CIscale) A scale has a normally distributed error with mean 0 and standard deviation 2.3 g. You measure a sample 10 times and observe the mean weight 43 g.
@ref(exr:CIHb)
obs <- c(145, 165, 134, 167, 158, 176, 156, 189, 143, 123)
mboot <- replicate(10000, {
x <- sample(obs, size=10, replace=TRUE)
mean(x)
})
hist(mboot)## 05% confidence interval
quantile(mboot, c(0.025, 0.975))## 2.5% 97.5%
## 143.9 167.2
(m <- mean(obs))## [1] 155.6
(v <- var(obs))## [1] 395.2
(s <- sd(obs))## [1] 19.88
n <- length(obs)
t <- qt(0.975, df=9)
##95% confidence interval
c(m - t*s/sqrt(n), m + t*s/sqrt(n))## [1] 141.4 169.8
@ref(exr:CIprop)
Calculate a 90% confidence interval instead. Or sample more people than 100.
Change the z number,
p ± z * SE
For a 90% confidence interval use z=1.64
p <- 0.42
n <- 100
SE <- sqrt(p*(1-p)/n)
z <- qnorm(0.95)
c(p - z*SE, p + z*SE)## [1] 0.3388 0.5012
z <- qnorm(0.995)
c(p - z*SE, p + z*SE)## [1] 0.2929 0.5471
@(ref:CIscale)
The measured weight is a random variable X ∼ N(μ,σ). You know that σ = 2.3, μ is the weight of the smaple.
## 95% ciónfidence interval
m <- 42
sigma <- 2.3
n <- 10
z <- qnorm(0.975)
c(m - z*sigma/sqrt(10), m + z*sigma/sqrt(10))## [1] 40.57 43.43
z <- qnorm(0.95)
c(m - z*sigma/sqrt(10), m + z*sigma/sqrt(10))## [1] 40.8 43.2
@(ref:BMCsmokers)
You observe 150 students at BMC of which 25 are smokers. Compute a 95% confidence interval for the proportion of smokers among BMC students.
p <- 25/150
n <- 150
z <-qnorm(0.975)
SE <- sqrt(p*(1-p)/n)
## 95% CI
c(p - z*SE, p + z*SE)## [1] 0.1070 0.2263
| Accept H0 | Reject H0 | |
| H0 is true | Type I error, false alarm | |
| H0 is false | Type II error, miss |
| Accept H0 | Reject H0 | |
| H0 is true | TN | FP |
| H0 is false | FN | TP |
Significance level
P(reject H0|H0 is true) = P(type I error) = α
Statistical power
P(reject H0|H1 is true) = P(reject H0|H0 is false) = 1 − P(type II error)
To achieve a family-wise error rate of ≤ α when performing m tests, declare significance and reject the null hypothesis for any test with p ≤ α/m.
Objections: too conservative
| H0 is true | H0 is false | |
| Accept H0 | TN | FN |
| Reject H0 | FP | TP |
The false discovery rate is the proportion of false positives among ‘hits’, i.e. $\frac{FP}{TP+FP}$.
Benjamini-Hochberg’s method control the FDR level, γ, when performing m independent tests, as follows:
Sometimes an adjusted significance threshold is not reported, but instead ‘adjusted’ p-values are reported.
Using Bonferroni’s method the ‘adjusted’ p-values are:
p̃i = min (mpi,1).
A feature’s adjusted p-value represents the smallest FWER at which the null hypothesis will be rejected, i.e. the feature will be deemed significant.
Benjamini-Hochberg’s ‘adjusted’ p-values are called q-values:
$q_i = \min(\frac{m}{i} p_i, 1)$
A feature’s q-value can be interpreted as the lowest FDR at which the corresponding null hypothesis will be rejected, i.e. the feature will be deemed significant.
| p-value | adj p (Bonferroni) | q-value (B-H) |
|---|---|---|
| 1.7e-08 | 0.0002 | 0.0002 |
| 5.8e-08 | 0.0006 | 0.0003 |
| 3.4e-07 | 0.0034 | 0.0011 |
| 9.1e-07 | 0.0091 | 0.0020 |
| 1e-06 | 0.0100 | 0.0020 |
| 2.4e-06 | 0.0240 | 0.0040 |
| 2.3e-05 | 0.2300 | 0.0329 |
| 3.6e-05 | 0.3600 | 0.0450 |
| 0.00022 | 1.0000 | 0.2300 |
| 0.00023 | 1.0000 | 0.2300 |
| 0.00073 | 1.0000 | 0.6636 |
| 0.0032 | 1.0000 | 1.0000 |
| 0.0045 | 1.0000 | 1.0000 |
| 0.0087 | 1.0000 | 1.0000 |
| 0.0089 | 1.0000 | 1.0000 |
| 0.012 | 1.0000 | 1.0000 |
| 0.014 | 1.0000 | 1.0000 |
| 0.045 | 1.0000 | 1.0000 |
| 0.08 | 1.0000 | 1.0000 |
| 0.23 | 1.0000 | 1.0000 |